Before proceeding, it might be helpful to look over the help pages for the getURL
, fromJSON
, ldply
, xmlToList
, read_html
, html_nodes
, html_table
, readHTMLTable
, htmltab
.
Moreover please load the following libraries.
install.packages("RCurl")
library(RCurl)
install.packages("rjson")
library(rjson)
install.packages("XML")
library(XML)
install.packages("plyr")
library(plyr)
install.packages("rvest")
library(rvest)
install.packages("htmltab")
library(htmltab)
Answers to the exercises are available here.
If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.
Exercise 1
Retrieve the source of the web page “https://raw.githubusercontent.com/VasTsak/r-exercises-dw/master/part-1/data.csv” and assign it to the object “url”
#################### # # # Exercise 1 # # # #################### url <- getURL("https://raw.githubusercontent.com/VasTsak/r-exercises-dw/master/part-1/data.csv")
Exercise 2
Read the csv file and assign it to the “csv_file” object.
#################### # # # Exercise 2 # # # #################### csv_file <- read.csv(text = url)
Exercise 3
Do the same as exercise 1, but with the url: “https://raw.githubusercontent.com/VasTsak/r-exercises-dw/master/part-2/data.txt” and then assign it to the “txt_file” object.
Note: it is a txt file, so you should use the adequate function in order to import it.
#################### # # # Exercise 3 # # # #################### url <- getURL("https://raw.githubusercontent.com/VasTsak/r-exercises-dw/master/part-2/data.txt") txt_file <- read.table(text = url)
Exercise 4
Do the same as exercise 1, but with the url: “https://raw.githubusercontent.com/VasTsak/r-exercises-dw/master/part-2/data.json” and then assign it to the “json_file” object.
Note: it is a json file, so you should use the adequate function in order to import it.
#################### # # # Exercise 4 # # # #################### url <- getURL("https://raw.githubusercontent.com/VasTsak/r-exercises-dw/master/part-2/data.json") json_file <- fromJSON(url)
Exercise 5
Do the same as exercise 1, but with the url: “https://raw.githubusercontent.com/VasTsak/r-exercises-dw/master/part-2/data.xml” and then assign it to the “xml_file” object.
Note: it is a xml file, so you should use the adequate function in order to import it.
#################### # # # Exercise 5 # # # #################### url <- getURL("https://raw.githubusercontent.com/VasTsak/r-exercises-dw/master/part-2/data.xml") xml_file <- ldply(xmlToList(url), data.frame)
Exercise 6
We will go through web scraping now. Read the html file “http://www.worldatlas.com/articles/largest-cities-in-europe-by-population.html” and assign it to the object “url”.
hint: consider using read_html
#################### # # # Exercise 6 # # # #################### url <- read_html("http://www.worldatlas.com/articles/largest-cities-in-europe-by-population.html")Exercise 7
Select the “table” nodes from the html document you retrieved before.
hint: consider usinghtml_nodes
#################### # # # Exercise 7 # # # #################### tbls <- html_nodes(url, "table")Exercise 8
Convert the node you retrieved at exercise 7, to an actionable list for processing.
hint: consider usinghtml_table
#################### # # # Exercise 8 # # # #################### tbls_read <- url %>% html_nodes("table") %>% html_table(fill = TRUE)Exercise 9
Let’s go to a faster and more straight forward function, retrieve the html document like you did at exercise 6 and make it an actionable list using the function
readHTMLTable
.#################### # # # Exercise 9 # # # #################### url <- "http://www.worldatlas.com/articles/largest-cities-in-europe-by-population.html" tbls_xml <- readHTMLTable(url)Exercise 10
This may be a bit tricky, but give it a try. Retrieve the html document like you did at exercise 6 and make it an actionable data frame using the function
htmltab
.#################### # # # Exercise 10 # # # #################### df_pop <- htmltab(doc = url, which = "//th[text() = 'Rank']/ancestor::table")